Closed Source Distillation Is A Half-Finished Puzzle

Everyone is distilling models lately. TeichAI does it. I do it. The internet is full of tiny models claiming to be smart because they learned from big models. There is a catch. A big one. Closed source models will never be properly distilled through an API.

I know this sounds like sour grapes from someone who cannot afford API credits. This comes down to math. This comes down to probability. This comes down to the difference between knowing what someone said and knowing what they almost said.

Distillation without logits is like learning to cook by tasting the food but never seeing the recipe. You might get close. You will never be the chef.

The Secret Sauce: Top 50 Tokens

Proper distillation requires more than the final output text. You need the probability distribution. For every token the teacher model generates, you need to know the top 50 tokens it considered. You need the probabilities. You need the logits.

This tells the student model what is right and what is almost right. It teaches nuance. It teaches the shape of the decision boundary. It teaches the model why "Paris" ranks higher than "London" for the capital of France even if both are cities.

                        # What proper distillation data looks like

                        Token: "Paris"

                        Probability: 0.85

                        Top 50 Alternatives:

                          1. London (0.05)

                          2. Berlin (0.03)

                          3. Rome (0.02)

                          ...

                        # This is gold. This is what makes small models smart.

Without this information you are training on completions only. You are teaching the student to mimic the answer. The reasoning process stays hidden.

My Haiku Experience

Parameters

10x

Better Coherence

I am using distillation on my Haiku model. It has 1 million parameters. It used to be silent. It used to output gibberish. Now it speaks. It forms sentences. It occasionally makes sense. This improvement came from distillation with full probability distributions.

The model learned which tokens to pick and which tokens to avoid. It learned the confidence of the teacher. When the teacher was uncertain the student learned to be uncertain. When the teacher was confident the student learned to commit. This nuance is invisible in text-only distillation.

Haiku still thinks fish are numbers sometimes. Progress remains slow. At least it speaks now.

The API Workaround Problem

People suggest a workaround. Prompt the API 50 times with high temperature. Collect all the different outputs. Use that as a proxy for the distribution. This approach has problems.

Sampling is random. Probability is exact. Rolling a die 50 times tells you something about the die. It does not tell you the weight of each face. Results can be skewed. Accuracy suffers.

                        # Why sampling 50 times falls short

                        True Distribution: [0.5, 0.3, 0.2]

                        50 Samples: Might look like [0.4, 0.4, 0.2]

                        Variance: High

                        Accuracy: Low

                        # You are approximating an approximation.

It is also expensive. Fifty API calls per token. A single response with 100 tokens costs 5000 API calls. You will go bankrupt before you finish one dataset. The economics do not work. The math does not work.

My bank account prefers local training. My bank account prefers free gradients. My bank account cries when I mention API keys.

Why Closed Source Fails Here

Closed source providers do not give you logits. They give you text. They give you a finished product. They do not give you the ingredients. This protects their IP. It also prevents proper distillation.

You can distill behavior. You can distill style. You can distill the final answers. The reasoning process requires the probability distribution. You are training a mimic. The thinker remains hidden behind the API wall.

Open weights are not just about freedom. They are about fidelity. You cannot teach a small model to think like a big model if you cannot see how the big model thinks.

The Open Source Advantage

Open weights give you everything. Logits. Probabilities. Hidden states if you want them. You can extract the full distribution for every token. You can train your student model on the exact decision process of the teacher.

Open source distillation will always be superior. Data quality matters. Data depth matters more. Text is surface level. Logits are deep structure. You need depth for true intelligence transfer.

What This Means For TinyMemoryLM

My Haiku model improved because I had access to the full distributions. Sonnet-2 will use Engram and TeichAI techniques. Opus will benefit from everything I have learned. They will all be open weights. They will all allow proper distillation.

I want people to be able to distill my models properly. I want them to have the logits. I want them to teach their own tiny models to think. This is how the ecosystem grows. Share the recipe. Let everyone cook.

Final Thoughts

Closed source models cannot be properly distilled. You can approximate. You can sample. You can spend thousands on API calls. You will never get the true distribution. You will never get the full picture.

Open weights are the only path to true distillation. If you care about small models being smart you should care about open weights. This comes down to math. This comes down to probability. This comes down to the top 50 tokens that make the difference between gibberish and speech.